The SCTP R package contains the proposed SCTP (Single Cell Tissue Phenotype prediction) method. SCTP provides a valuable approach for analyzing and understanding the cellular malignancy within the tumor microenvironment from an innovative and integrative perspective by combining the essential information from the bulk sample phenotype, single cell composition and cellular special distribution, which would be overlooked in traditional tissue pathological slice. As an automated tissue phenotype prediction model, SCTP facilitates a more profound understanding of tumor microenvironments, enables quantitative characterization of cancer hallmarks, and elucidates the underlying complex molecular and cellular interplay.
In this tutorial, we provide multiple examples to assist you in applying SCTP to real-world applications. It encompasses instructions for estimating the likelihood of colorectal cancer using a pre-trained model and guides on constructing a new SCTP model using datasets supplied by yourself.
We first load in the required package:
library(SCTP)
Please set up a virtual environment named with “env_SCTP,” ensuring it includes the required packages:
In this section, we outline the procedure for utilizing the SCTP-CRC
pretrained model (function SCTP_CRC) to evaluate cell or
spot malignancy in your own datasets.
The input data must be formatted as a Seurat object, particularly for spatial transcriptomics data, where examining the image component is highly recommended for visualization of the output.
For single cell data, in cases where only the counts matrix is
available, you could first use the function
Seurat_preprocess to converted into a Seurat object. This
function provide simplified preprocessing procedures and the output is a
Seurat object.
counts <- read.csv(
file="/Users/w435u/Documents/ST_SC/Method_Compare/data/IR/GSE115978_counts.csv",
header=TRUE,
row.names = 1
)
In this scRNA-seq dataset, each row represents a gene and each column represents a cell. The dimensions of this single-cell data are:
dim(counts)
## [1] 23686 7186
which indicates there are 23,686 genes and 7,186 cells in total. We use the functions provided from the Seurat package to preprocess this data. To simplify the process, we wrapped the Seurat analysis pipeline into the following function:
sc_dataset <- Seurat_preprocess(counts, verbose = F, type="SC")
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
## Found more than one class "dist" in cache; using the first, from namespace 'spam'
## Also defined by 'BiocGenerics'
The output is a Seurat object that contains the required preprocessed counts matrix, as well as other helpful dimensionality reduction results, such as the PCA, t-SNE, and UMAP.
names(sc_dataset)
## [1] "RNA" "RNA_nn" "RNA_snn" "pca" "tsne" "umap"
For the diversity of spatial transcriptomic formats, automatic preprocessing is unavailable from this package. You must initially process your data to create a Seurat object, which should include the SCT-normalized counts matrix and the image data.
Alternatively, you can also provide a Seurat object using your own pipeline, but at least a normalized data (assays$RNA@data) is required. Below we show examples with single-cell RNA-seq data (sc_dataset) and spatial transcriptomic data (st_dataset) respectively.
load("/Users/w435u/Documents/ST_SC/DATA_STSC_CAO/Seurat_sc_nonimmune_sub.RData") #sc_dataset
load("/Users/w435u/Documents/ST_SC/DATA_STSC_CAO/Seurat_L1.RData") #st_dataset
Check on single cell data for required information.
!is.null(sc_dataset@assays$RNA@data)
## [1] TRUE
Check on spatial transcriptomic data for required information.
!is.null(st_dataset@assays$SCT$data)
## [1] TRUE
We begin by visualizing the cells, categorized by types as annotated in the original study, presenting only non-immune cells. These are classified into Endothelial cells (E), Fibroblasts (F), and Tumor cells (Tu), which are further subdivided into subclusters shown below:
DimPlot(sc_dataset, group.by = "cluster", reduction="tsne")+ggtitle("Cell type")+
theme(legend.position = "bottom", legend.key.size = unit(2, 'mm'))
Using the provided input, we employ SCTP-CRC to estimate the likelihood of CRC tumor of each cell.
sc_dataset <- SCTP_CRC(my_seurat = sc_dataset)
## [1] "Calculating correlations with SC DATA"
## [1] "Calculating correlations with ST DATA"
## [1] "Predicting malignancy"
The predicted malignancy of each cell is stored as an new annotation “malignancy’ in the metadata of the output Seurat object.
names(sc_dataset@meta.data)
## [1] "orig.ident" "nCount_RNA" "nFeature_RNA"
## [4] "RNA_snn_res.0.6" "seurat_clusters" "X"
## [7] "patients" "sampletag" "organs"
## [10] "percent.mt" "percent.ribo" "log10GenesPerUMI"
## [13] "integrated_snn_res.0.5" "batch" "doublet.score"
## [16] "predicted.doublet" "doublet" "cluster"
## [19] "integrated_snn_res.0.1" "patients_organ" "malignancy"
## [22] "condition"
The results can subsequently be visualized using TSNE or UMAP plots. A value closer to 1 signifies a higher malignancy level in the corresponding spots, whereas a value close to 0 suggests a normal state.
FeaturePlot(sc_dataset, features = "malignancy", reduction="tsne", )+
scale_color_gradientn(colours = col_mal)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
When compared to the original cell type annotations, it is evident that a significant number of tumor cells have been assigned high malignancy scores, while non-tumor cells have been allocated low malignancy scores.
Next, we present an example using spatial transcriptomic data for
prediction. Utilizing a preloaded ST Seurat object, we employ the
SCTP_CRC function to predict the likelihood of tumor
presence in each spot.
st_dataset <- SCTP_CRC(my_seurat = st_dataset)
## [1] "Calculating correlations with SC DATA"
## [1] "Calculating correlations with ST DATA"
## [1] "Predicting malignancy"
Same as single-cell data input, the predicted malignancy of each spot is stored in the annotation “malignancy’ in the output Seurat object.
names(st_dataset@meta.data)
## [1] "orig.ident" "nCount_RNA" "nFeature_RNA"
## [4] "nCount_SCT" "nFeature_SCT" "SCT_snn_res.0.8"
## [7] "seurat_clusters" "sign_L1" "pred_st"
## [10] "pseudotime" "pEMT1" "Proliferation1"
## [13] "Cell_cycle_S1" "Cell_cycle_G2_G2M1" "EMT_angiogenesis1"
## [16] "EMT_transfer1" "EMT_hardness1" "EMT_invades1"
## [19] "tu_sub" "pred_cell" "pred_cell0"
## [22] "neighbor_cell" "neighbor_cell0" "malignancy"
## [25] "condition"
You can then visualize by SpatialFeaturePlot for spatial transcriptomic data. Value closer to 1 indicates higher malignancy of the corresponding spots, while value close to 0 indicates normal state.
SpatialFeaturePlot(st_dataset, features = "malignancy")+
scale_fill_gradientn(colours = col_mal)
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
At the single-cell level, standard downstream analyses typically include differential expression gene (DEG) analysis and pathway analysis, following malignancy predictions. In this tutorial, we focus on the novel downstream analysis performed on spatial spots. Once predictions on the spatial transcriptomic data are acquired, several subsequent analyses were conducted. Following the classification of spatial spots as tumor or normal using SCTP-CRC model, we embarked on a series of downstream analyses tailored to spatial transcriptomics.
The new annotation “condition” describes the tumor state. You can identifiy differentially expressed genes between “Tumor” and “Normal” tisseus by the funtion “FindWarkers” as provided in the package Seurat.
Tumor_enriched <- FindMarkers(st_dataset, ident.1 = "Tumor", group.by = 'condition', verbose = FALSE)
head(Tumor_enriched)
After estimated the tumor likelihood of each spots, users can chose to perform the trajectory analysis with the function ‘SCTP_invasion’. The function will first check if the ‘malignancy’ is already available in the my_seurat metadata. If not, users should first estimate with the function ‘SCTP_CRC’.
The purpose of this step is to estimate the tumor invasion route, therefore the trajectory analysis will only be performed on malignant spots (i.e. malignancy >0.5). The pseudotime of non-malignant spots will be assigned as -Inf and out of the scope for interpretation.
st_dataset <- SCTP_invasion(my_seurat = st_dataset)
## [1] "Estimating tumor pseudotime"
## Registered S3 methods overwritten by 'proxy':
## method from
## print.registry_field registry
## print.registry_entry registry
## | | | 0% | |======================================================================| 100%
## | | | 0% | |======================================================================| 100%
We can visualize the estimated pseudotime:
SpatialFeaturePlot(st_dataset, features = "pseudotime")+
scale_fill_gradientn(colours = col_time)
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
Light color represents spots with a longer period of existence, which can be interpreted as the Root on the histology image. The invasion of the tumor is represented by gradually darked color. Finally, spots colored with dark red are considered as Front.
The cell-cell communication in the spatial context is interpreted as spot-spot communication. To facilitate the presentation, the areas on the tissue were first divided into different regions
Subsequently, the CellChat package was utilized to explore ligand-receptor interactions across spots within these distinct regions. The intensity of these interactions can be depicted via circle plots or heatmaps, with crucial signaling interactions being aggregated and emphasized. The ‘SCTP_cellchat’ function offers a streamlined approach for conducting these various steps in cell-cell communication analysis, simplifying the process for users. The output is a CellChat object, ready for any further analysis.
json_path <- "/Users/w435u/Documents/ST_SC/DATA_STSC_CAO/L1/spatial"
cellchat <- SCTP_cellchat(my_seurat = st_dataset, json_file = json_path)
## [1] "Create a CellChat object from a data matrix"
## Create a CellChat object from spatial imaging data...
## Set cell identities for the new CellChat object
## The cell groups used for CellChat analysis are Front Invasion Normal Root T2
## truncatedMean is used for calculating the average gene expression per cell group.
## [1] ">>> Run CellChat on spatial imaging data using distances as constraints <<< [2024-03-26 16:27:31.445816]"
## The suggested minimum value of scaled distances is in [1,2], and the calculated value here is 1.280518
## [1] ">>> CellChat inference is done. Parameter values are stored in `object@options$parameter` <<< [2024-03-26 16:28:46.012286]"
The function was used to generate circle plots, illustrating the strength of ligand-receptor interactions. And, the function was implemented to present the number of interaction pairs as a heatmap, facilitating an intuitive understanding of the interaction frequency and patterns.
#par(mfrow = c(1,2), xpd=TRUE)
netVisual_circle(cellchat@net$count, vertex.weight = rowSums(cellchat@net$count),
weight.scale = T, label.edge= F, title.name = "Number of interactions")
netVisual_heatmap(cellchat, measure = "count", color.heatmap = "Blues")
## Do heatmap based on a single object
We can visualize several signaling pathways of interests. More details on the cell-cell communication can be found at the chellchat v2. website (https://github.com/sqjin/CellChat).
cellchat@netP$pathways
## [1] "COLLAGEN" "FN1" "LAMININ" "MK"
## [5] "APP" "CEACAM" "CLDN" "CDH"
## [9] "SPP1" "VTN" "CDH1" "THBS"
## [13] "MIF" "ADGRG" "TENASCIN" "CD99"
## [17] "COMPLEMENT" "ApoA" "MHC-II" "PARs"
## [21] "GDF" "SEMA4" "IGFBP" "EGF"
## [25] "CXCL" "ANGPTL" "EPHA" "ApoB"
## [29] "Cholesterol" "ADGRE" "DHEAS" "TGFb"
## [33] "MHC-I" "ApoE" "EPHB" "PECAM1"
## [37] "PROS" "PTPR" "PERIOSTIN" "Androsterone"
## [41] "Prostaglandin" "SEMA3" "HSPG" "NECTIN"
## [45] "VISFATIN" "ICAM" "LIGHT" "JAM"
## [49] "Desmosterol" "DESMOSOME" "OCLN" "VCAM"
## [53] "BMP" "CCL" "PROCR" "ADGRA"
## [57] "AGT" "SEMA5" "AGRN" "THY1"
## [61] "GRN" "PDGF" "GAS" "NOTCH"
## [65] "CSF" "MMP" "ANNEXIN" "CD45"
## [69] "VEGF" "CDH5" "TWEAK" "CD46"
## [73] "VWF" "RELN" "FGF" "CADM"
## [77] "CD96" "HGF" "GAP" "PROC"
## [81] "IL16" "CLEC" "ADGRE5" "ADGRL"
## [85] "PVR" "BRADYKININ" "ESAM" "WNT"
## [89] "NRXN" "SLIT" "KLK" "PTN"
## [93] "ACTIVIN" "CALCR" "SIRP" "IGF"
## [97] "IL1" "FLRT" "UNC5" "CysLTs"
## [101] "NRG" "PLAU" "Netrin" "SEMA6"
## [105] "TIGIT" "IL2" "ALCAM" "CD6"
## [109] "CD86" "PCDH" "CX3C" "TXA2"
## [113] "SELL" "PTPRM"
From the key signaling pathway listed above, we chose to visualize two well-known pathways, one for tumor invasion (WNT pathway) and one for tumor suppression (IGFBP pathway).
pathways.show <- "WNT" # tumor invasion
netVisual_aggregate(cellchat, signaling = pathways.show, layout = "spatial",
edge.width.max = 2, vertex.size.max = 1,
alpha.image = 0.2, vertex.label.cex = 3.5)
pathways.show <- "IGFBP" # tumor suppression
netVisual_aggregate(cellchat, signaling = pathways.show, layout = "spatial",
edge.width.max = 2, vertex.size.max = 1,
alpha.image = 0.2, vertex.label.cex = 3.5)
To elucidate the molecular functional diversity (e.g. cancer hallmarks) across various tumor regions, we utilized the function from the Seurat package, which involved a pre-defined list of genes that represents a distinct functional category and cancer signatures. SCTP provide several modules related to the tumor invasion signaling pathway:
module_name <- "pEMT"
genes_list <- all_genelist[[module_name]]
st_dataset <- AddModuleScore(st_dataset, features = list(genes_list),
name=module_name)
SpatialFeaturePlot(st_dataset, features = paste0(module_name, "1"))
If you are interested in building your own SCTP model (especially for other cancer types), this section provides a demonstration of how to train your own model on your own datasets.
The input materials to train the model are the following
Or you can load the Seurat objects for single-cell and spatial transcriptomic as input.
load("/Users/w435u/Documents/ST_SC/DATA_STSC_CAO/Seurat_sc100.RData")
load("/Users/w435u/Documents/ST_SC/DATA_STSC_CAO/Seurat_C1.RData")
Next we load the bulk data, the rows and columns are genes and samples, respectively. The dimensions of the bulk dataset are:
load("/Users/w435u/Documents/ST_SC/OUTPUT_gse44067/bulk_gse44076.RData")
dim(bulk_dataset)
## [1] 49386 196
There are 196 bulk samples, providing gene expression profiles of
paired normal adjacent mucosa and tumor samples from 98 CRC individuals.
The pheno type is provided as Tumor = 1 and
Normal = 0.
table(pheno)
## pheno
## 0 1
## 98 98
With all input sources loaded into the environment, we use the function ‘SCTP’ to train the model. (A dedicated folder is needed to store temporary outputs.)
Tuning parameter is a crucial step and can take some time in order to get a predictive model. Key parameters include:
Below is an example with default parameters
mod <- SCTP(sc_expr=sc_dataset_new, cell_types=filtered_annot$Cell_subtype, st_dataset = st_dataset, bulk_dataset = bulk_dataset, pheno=pheno, out_dir="/Users/w435u/Documents/OUTPUT", my_seed=12345, n.epoch = 500)
The output of the function is a list with the following elements:
They are essential for prediction.
str(mod)
With the trained model, we predict on the new ST dataset:
load("/Users/w435u/Documents/ST_SC/DATA_STSC_CAO/Seurat_L1.RData")
dim(st_dataset)
st_dataset <- SCTP_pred(SCTP_model = mod, my_seurat=st_dataset)
Even when only single-cell or spatial transcriptomics (ST) data are available, constructing a SCTP model remains feasible. The following example illustrates the process using solely ST data, though similar steps can be adapted for single-cell data. This model is developed based on the spatial transcriptomics data from a sample with primary colorectal cancer (CRC).
First load the ST data and bulk data.
load("/Users/w435u/Documents/ST_SC/OUTPUT_gse44067/bulk_gse44076.RData")
load("/Users/w435u/Documents/ST_SC/DATA_STSC_CAO/Seurat_C1.RData")
dim(st_dataset)
## [1] 15833 2054
The function SCTP_mono accepts a Seurat object st_dataset, a bulk_dataset, and the phenotype of bulk samples as input.
mod_mono <- SCTP_mono(my_seed=12345, n.epoch = 500, st_dataset = st_dataset, bulk_dataset = bulk_dataset, pheno=pheno, out_dir="/Users/w435u/Documents/OUTPUT")
## [1] "Network and features are well saved."
## [1] "Training your model."
The model includes the estimated coefficient beta and its corresponding matrix, both of which are utilized for prediction purposes.
str(mod_mono)
## List of 2
## $ beta : num [1:2054, 1] -0.336 -0.336 -0.42 -0.32 -0.333 ...
## $ st_expr:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:13705400] 0 1 2 6 7 9 11 13 14 16 ...
## .. ..@ p : int [1:2055] 0 9961 12770 22626 28199 38382 44210 51196 57106 63398 ...
## .. ..@ Dim : int [1:2] 17943 2054
## .. ..@ Dimnames:List of 2
## .. .. ..$ : chr [1:17943] "SAMD11" "NOC2L" "KLHL17" "PLEKHN1" ...
## .. .. ..$ : chr [1:2054] "AAACAAGTATCTCCCA-1" "AAACAGAGCGACTCCT-1" "AAACATTTCCCGGATT-1" "AAACCTAAGCAGCCGG-1" ...
## .. ..@ x : num [1:13705400] 2 10 1 6 7 4 1 24 7 6 ...
## .. ..@ factors : list()
To make predictions on another set of spatial data, you should input the pre-trained SCTP single modality model along with your spatial data.
load("/Users/w435u/Documents/ST_SC/DATA_STSC_CAO/Seurat_C3.RData")
st_dataset <- SCTP_mono_pred(SCTP_model = mod_mono, my_seurat=st_dataset)
## [1] "Calculating correlations with ST DATA"
## [1] "Predicting malignancy for all cells or spots"
SpatialFeaturePlot(st_dataset, features = "malignancy")+
scale_fill_gradientn(colours = col_mal)
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.